Goto

Collaborating Authors

 log log


Entropy testing and its application to testing Bayesian networks

Neural Information Processing Systems

This paper studies the problem of entropy identity testing: given sample access to a distribution p and a fully described distribution q (both discrete distributions over a domain of size k), and the promise that either p = q or |H (p) H (q)| ε, where H () denotes the Shannon entropy, a tester needs to distinguish between the two cases with high probability.



Adaptive Linear Estimating Equations

Neural Information Processing Systems

Sequential data collection has emerged as a widely adopted technique for enhancing the efficiency of data gathering processes. Despite its advantages, such data collection mechanism often introduces complexities to the statistical inference procedure.



Batched Thompson Sampling

Neural Information Processing Systems

O (log log(T)) expected batch complexity. This is achieved through a dynamic batching strategy, which uses the agents estimates to adaptively increase the batch duration.



Simultaneous Approximation of the Score Function and Its Derivatives by Deep Neural Networks

Yakovlev, Konstantin, Puchkin, Nikita

arXiv.org Machine Learning

Score estimation, the task of learning the gradient of the log density, has become a crucial part of generative diffusion models [Song and Ermon, 2019, Song et al., 2021]. These models achieve state-of-the-art performance in a wide range of domains including images, audio and video synthesis [Dhariwal and Nichol, 2021, Kong et al., 2021, Ho et al., 2022]. To sample from the desired distribution, one needs to have an accurate score function estimator along the Ornstein-Uhlenbeck process. In the context of diffusion models the score estimation is done through the minimization of denoising score matching loss function over the class of neural networks [Song et al., 2021, Vincent, 2011, Oko et al., 2023]. Another recipe for score estimation is implicit score matching proposed by Hyv arinen [2005]. The proposed objective includes not only the score function, but also its Jacobian trace. A crucial research question is to determine the iteration complexity of the distribution estimation given inaccurate score function. The convergence theory of diffusion models has received much attention in the recent years. Some works [De Bortoli, 2022, Chen et al., 2023b, Benton et al., 2024, Li and Y an, 2024] study SDE-based samplers under the assumption that the score estimator is L


Batched Thompson Sampling

Neural Information Processing Systems

We introduce a novel anytime batched Thompson sampling policy for multi-armed bandits where the agent observes the rewards of her actions and adjusts her policy only at the end of a small number of batches. We show that this policy simultaneously achieves a problem dependent regret of order $O(\log(T))$ and a minimax regret of order $O(\sqrt{T\log(T)})$ while the number of batches can be bounded by $O(\log(T))$ independent of the problem instance over a time horizon $T$. We also prove that in expectation the instance dependent batch complexity of our policy is of order $O(\log\log(T))$. These results indicate that Thompson sampling performs competitively with recently proposed algorithms for the batched setting, which optimize the batch structure for a given time horizon $T$ and prioritize exploration in the beginning of the experiment to eliminate suboptimal actions. Unlike these algorithms, the batched Thompson sampling algorithm we propose is an anytime policy, i.e. it operates without the knowledge of the time horizon $T$, and as such it is the only anytime algorithm that achieves optimal regret with $O(\log\log(T))$ expected batch complexity. This is achieved through a dynamic batching strategy, which uses the agents estimates to adaptively increase the batch duration.


Stopping Rules for Stochastic Gradient Descent via Anytime-Valid Confidence Sequences

Aolaritei, Liviu, Jordan, Michael I.

arXiv.org Machine Learning

We study stopping rules for stochastic gradient descent (SGD) for convex optimization from the perspective of anytime-valid confidence sequences. Classical analyses of SGD provide convergence guarantees in expectation or at a fixed horizon, but offer no statistically valid way to assess, at an arbitrary time, how close the current iterate is to the optimum. We develop an anytime-valid, data-dependent upper confidence sequence for the weighted average suboptimality of projected SGD, constructed via nonnegative supermartingales and requiring no smoothness or strong convexity. This confidence sequence yields a simple stopping rule that is provably $\varepsilon$-optimal with probability at least $1-α$, with explicit bounds on the stopping time under standard stochastic approximation stepsizes. To the best of our knowledge, these are the first rigorous, time-uniform performance guarantees and finite-time $\varepsilon$-optimality certificates for projected SGD with general convex objectives, based solely on observable trajectory quantities.


A Bayesian approach to learning mixtures of nonparametric components

Zhang, Yilei, Wei, Yun, Guha, Aritra, Nguyen, XuanLong

arXiv.org Machine Learning

Mixture models are widely used in modeling heterogeneous data populations. A standard approach of mixture modeling is to assume that the mixture component takes a parametric kernel form, while the flexibility of the model can be obtained by using a large or possibly unbounded number of such parametric kernels. In many applications, making parametric assumptions on the latent subpopulation distributions may be unrealistic, which motivates the need for nonparametric modeling of the mixture components themselves. In this paper we study finite mixtures with nonparametric mixture components, using a Bayesian nonparametric modeling approach. In particular, it is assumed that the data population is generated according to a finite mixture of latent component distributions, where each component is endowed with a Bayesian nonparametric prior such as the Dirichlet process mixture. We present conditions under which the individual mixture component's distributions can be identified, and establish posterior contraction behavior for the data population's density, as well as densities of the latent mixture components. We develop an efficient MCMC algorithm for posterior inference and demonstrate via simulation studies and real-world data illustrations that it is possible to efficiently learn complex distributions for the latent subpopulations. In theory, the posterior contraction rate of the component densities is nearly polynomial, which is a significant improvement over the logarithm convergence rate of estimating mixing measures via deconvolution.